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Abstract —Mixture of parts model has been successfully applied 
to 2D human pose estimation problem either as explicitly trained 
body part model or as latent variables for the whole human 
body model. Mixture of parts model usually utilize tree structure 
for representing relations between body parts. Tree structures 
facilitate training and referencing of the model but could not deal 
with double counting problems, which hinder its applications in 
3D pose estimation. While most of work targeted to solve these 
problems tend to modify the tree models or the optimization 
target. We incorporate other cues from input features. For 
example, in surveillance environments, human silhouettes can be 
extracted relative easily although not flawlessly. In this condition, 
we can combine extracted human blobs with histogram of 
gradient feature, which is commonly used in mixture of parts 
model for training body part templates. The method can be 
easily extend to other candidate features under our generalized 
framework. We show 2D body part detection results on a public 
available dataset: HumanEva dataset. Furthermore, a 2D to 3D 
pose estimator is trained with Gaussian process regression model 
and 2D body part detections from the proposed method is fed to 
the estimator, thus 3D poses are predictable given new 2D body 
part detections. We also show results of 3D pose estimation on 
HumanEva dataset. 

Index Terms —Pose estimation, double counting problem, mix¬ 
ture of parts Model 


I. Introduction 

Pose estimation from still images has wide applications 
in image and video indexing, video surveillance and human 
computer interaction. For example, online solutions of this 
problem can be applied for single frame initialization in 
tracking human poses. Yet pose estimation from still images, 
that is, 2D body part localization is a difficult problem, due 
to the fact that human body is highly flexible resulting human 
poses with high degrees of freedom even in 2D images. 

A state-of-art and currently widely used solution for 2D 
body part detection is the mixture-of-parts (MoP) method (I), 
in which a human body is modeled as a tree structure and 
body parts are encoded as nodes in the tree. Maximum 
responses from detection are passed from the leaf nodes to the 
root. One problem with this solution is the double-counting 
problem, that is, one detected body part is counted twice 
for both sides of the human body. In this paper, we tackle 
the double-counting problem in MoP model with multiple 
feature inputs. Additional input features are incorporated so 
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we are able to verify body part localization from more feature 
responses. Compared with the greedy solution in the original 
MoP method, we are able to solve double-counting problem 
with a global optimization. 

With detected 2D body part locations from the proposed 
method, we are able to predict 3D poses by feeding 2D body 
part detections to a 2D to 3D pose estimator. For 2D to 
3D pose estimator, we choose Gaussian process regression, 
which has been proved to be effective in modeling non-linear 
regression problems. We further validate the whole pipeline 
on a public available dataset for pose estimation: HumanEva 
dataset. We visualized two types of results: enhanced 2D body 
part detections and 3D poses estimated from enhanced 2D 
body part detections. Figure [T] shows main steps as a pipeline 
for the whole algorithm. The pipeline includes several major 
steps: feature extractions (MoP and background subtraction, in 
our case), global optimization, and 2D to 3D pose estimation. 
From the input, the original mixtures of part model is trained 
and applied to detect body part positions. Meanwhile, human 
blobs extracted with background subtractions are used as 
another cue in our method. Then in the third step, this two 
cues are combined with the proposed algorithm. 

The generalized framework in our algorithm are able to 
incorporate multiple features other than human blobs. We 
extract multiple cues from input images and combine them 
under the proposed framework. With augmented inputs from 
multiple features, we are able improve 2D body part localiza¬ 
tion and solve double counting problem. As in MoP method, 
a human body is modeled by a tree structure where kinematic 
constraints between connecting parts are kept. First, feature 
models of different features are trained separately and the 
optimization target is modified to reach a global optimization 
target by incorporating multiple feature cues. The inference 
of the optimal pose in a test image is also carried out with 
multiple cues. The bottom up message passing procedure com¬ 
bines multiple feature cues so as to reduce false positive body 
part detections and the top down back tracing procedure are 
optimized globally so as to tackle double counting problem. 

The contributions of this papers are as followings: by 
combing multiple cues, we boost 2D body part localizations 
under a general framework; enhanced 2D body part detectors 
are validated on a public available dataset: HumanEva dataset; 
3D pose estimation is shown as an examplar application 
of detected 2D poses and this application is also validated 
on HumanEva dataset. The rest of the paper is organized 
as following: in section [II| we introduce related works on 
2D pose estimation and related works on solving double 
counting problems; in section [TTl| we introduce details of 
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4. 3D Pose 
Estimation 



Fig. 1. 2D body part detections and 3D pose estimation based on enhanced Mixtures of Parts method. The pipeline includes several major steps: feature 
extractions (MoP and background subtraction, in our case), global optimization, and 2D to 3D pose estimation. 


the proposed method including training models from different 
features and the global optimization taget; section [TV] shows 
boosted 2D body part localizations by combining multiple 
feature cues on a standard public available dataset and the 
examplar application of 3D pose estimation from localized 
2D body parts; in section [V] we conclude the work and discuss 
about possible future works. 

II. Related Work 

As mentioned in the previous section, human bodies are 
highly flexible, thus results in a huge amount of possible 
guesses in the solution space. But human body joints are 
not completely under no constraints. Models like pictorial 
structure model m and tree models 0, 0, CQ are exploited 
and successfully applied to represent human body models in 
2D. These models keep kinematic constraints between con¬ 
necting body parts. That is, the body parts that are connected 
physically are also connected in the tree structure. Using 
tree structures has the advantage of tractable inference of the 
optimal pose. However, spatial constraints between body parts 
without direct connections are not incorporated in the tree 
structure. Due to this reason, the original tree structure cannot 
deal with occlusion and has the problem of double counting, 
where an image evidence is counted twice for different body 
parts. 

One important example of tree models is mixture of 
parts model m. MoP defines body parts as an area sur¬ 
rounding body joints and has the advantage in dealing with 
the foreshortening problem of body limbs caused by view¬ 
point changes. While the traditional pictorial structure (PS) 
model 0, which defines body parts as body limbs, needs 
to deal with foreshortening problem by explicitly training on 
body limbs of different lengths. Also in MoP model, the 
orientation of a limb is naturally represented by the connection 
of detected body joints. While in a traditional PS model, limb 
orientations need to be learned and detected explicitly. So 
we choose MoP model as the human body model. A body 
part in the MoP model is represented as a mixture of several 
templates, each of which is trained with one subset of samples 
of this body part. In this way, the trained body part is able to 
deal with different limb layouts from different poses. 


As mentioned in the first paragraph, although the tree 
structured human model is efficient in training and referencing, 
it has the double counting problem due to occlusions and lack 
of constraints between body parts denoted as a node in the tree 
structure. To deal with these problems, authors in J4J propose 
multiple tree models. The models contain a tree structure to 
account for kinematic constraints between connected body 
parts, tree structures for spatial constraints among body parts 
without direct connections, and tree structures for occluded 
body parts. Different tree structures are combined with a boost¬ 
ing procedure. Other research also explore the possibility of 
imposing constraints in the optimization target. For example, 
authors in 0 modify the optimization target and incorporate 
spatial constraints to deal with double counting problem. In 
referencing, those poses who violate the spatial constraints will 
get a comparatively lower score. 

III. The Method 

Given training images with only one human in each image, 
we train 2D body part detectors with image patches cropped 
within bounding boxes surrounding the body parts. For a 
test image, we localize 2D body part positions with trained 
detector and optimize the detection with multiple feature 
cues. We name the detector enhanced MoP model since it 
is based on the MoP model proposed in (T). In the following 
subsection, we are going to split the method into modules and 
explain in details. 

A. Mixture of Parts 

The idea of mixture of parts detector in Ql is to represent 
a body part with a mixture of several (5 or 6) templates, 
each represent a different appearance of the corresponding 
body part. So the body part which has more variances in 
appearance, for example, elbows and knees, are apt to contain 
more templates. After cropping the image surrounding the 
bounding box with a proper size, all samples of a body 
parts are clustered into several clusters, whose total number is 
predefined according to the variance of the body part. Training 
templates are formulated as optimizing parameters in a support 
vector machine, which is carried out with EQ optimization. 
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Note that the size of the bounding box is a crucial factor in 
adapting the method to custom data. Considering the different 
notation in each dataset, body joints might correspond to 
different position and if the size of the bounding box is defined 
too big, it might contain information from other body part and 
if the size is too small, it might be lack of information for 
identifying the body part or joint. 

After training templates for each body part, given a test 
image, we compute the response of the image with respect 
to all trained templates by convolution. Then a distance 
transform Q is performed so that the maximum response of 
the image to the test template is highlighted. Later on, we 
start from the leaves of the human tree structure (rooted at the 
head), and pass maximum responses of all mixtures from the 
child body part to its parent. Thus, when we come to the root 
node, all the body part nodes contribute by passing messages. 
The score of the root is considered the final score of the human 
detection. This tree structure is very effective in referencing 
but it has problem dealing with double counting problem. In 
the following subsections, we are going to explain how we 
are going to enhance the algorithm based on the original MoP 
model. 

B. Enhanced MoP Via Multiple Cues Fusion 

Instead of imposing spatial constraints or modifying tree 
structure model, we explore the possibility of combining 
multiple cues from input images. We argue that multiple 
feature cues provides richer information so that effectively 
combining multiple cues reduce false positives and ease double 
counting problem. In our experiments, we consider histogram 
of gradient (HOG) (6) and human blobs extracted from back¬ 
ground subtraction (7). 

1) Formulation of Enhanced Model: Let us write I for an 
image, pi = (x y) for the pixel location of part i and ti for 
the mixture component of part i. We write i G {1,..., if}, 
Pi G {1,..., L} and ti G {1,..., T}. We call L the “type” 
of part i. For notational convenience, we define the lack of 
subscript to indicate a set spanned by that subscript (e.g., 

t = {G,_, tx})- The kinematic constraints of human body 

between connected body parts are modeled as following: 

5W = E b ‘ i + E b T d - a) 

iev ijeE 

The parameter b\ l favors particular type assignments for 
part i, while the pairwise parameter favors particular co¬ 
occurrences of part types. We write G = (V;E) for a K-node 
relational graph whose edges specify which pairs of parts are 
constrained to have consistent relations. 

We can now write the full score associated with a configu¬ 
ration of part types and positions: 

s(t) = s(t) + ^2 + "22 ^1} ,tj ■ ^( pi -pj)> ( 2 ) 

i£V ij EE 

where f(I,pi) is a HoG vector extracted from pixel location 
Pi in image I. f>(pi — Pj) = [dx dx 2 dy dy 2 ] T , where 
dx = Xi — Xj and dy = yi — yj , the relative location of part 
i with respect to j. 


Until now, this is the original MoP model from m. Since 
multiple features are extracted separately, we can train models 
from each candidate features separately. For example, when we 
use extracted human blobs as another feature cue. We get the 
human blob model from background subtraction as following: 

w X _ ft, if\E(pi) - B(pi) I > threshold , 

UM |0, if\E(pi) — B(pi)\ < threshold , ' ' 

where B(p2 is the background model and can be updated 
with new added frames in the following way, 

B t +i{pi) = ex * Ftipi) + (1 - a) * B t (pi). (4) 

And a is the learning rate. We denote this model as human 
blob (HB) model. 

After training MoP model and HB model separately, given 
a test image, we need to find the optimal human pose with 
respect to certain criterion. This criterion should take into 
account both of the trained models. Since we suppose each 
image features are extracted separately. We can get the joint 
probability of matching two models as: 


P(M, H) = P(M\H) • P(H), (5) 


where M represents the MoP model and H represent the HB 
model. This probabilities formulation can be easily extended 
to other image feature cues, given the definition of the model 
probabilities and conditional probabilities. 

In our method, we consider HOG and detected human blobs. 
We define P(H) for each pixel m as following: 



ifFm = 1 , 

ifFm = 0. 


( 6 ) 


Since the HOG feature and the human blob feature are 
extracted separately and thereafter MoP model and HB model 
are trained separately, P(M\H) equals P(M ) which is defined 
in equation ([2]). In implementation, we calculate the probability 
of a certain pixel m belonging to a certain body part by 
convoluting with image evidence of this pixel with trained 
body part template. 

2) Finding Root Positions: After training MoP model and 
HB model separately from HOG and human blob features. 
We can detect human pose from an unseen image by find 
the optimal human pose. In (T|, with all trained mixtures of 
parts models, the test image is convoluted with each trained 
templates. Then starting from the leaves of the tree structure, 
responses of all body parts are passed to their parent parts. 
After one pass, all the body parts contribute their score to the 
root part of the tree structure. 

In our combined model, before passing the score from the 
child node to its parent node, we check if this pixel also 
confirms with the evidence from human blob detection. If the 
current pixel belongs to the detected human blob, we keep the 
current score, otherwise the score is set to a very small value. 
This procedure guarantee that the final probability is the joint 
probability of two candidate feature models. The advantage of 
this procedure is obvious, we can remove some false positives 
by verifying that the current pixels confirm with both models 
trained from different image feature cues. So the detected root 
position is more accurate. After we find the root position of 
the human, we go through the whole tree to fix each body 
parts with global optimization. 
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3) Finding Body Part Positions: From the detected root 
position of the tree structure, authors in m employ a back¬ 
tracking algorithm to fix all the body part positions. It starts 
from the root of the tree structure and fix its child node by 
picking the maximum response from all the child nodes. This 
method causes the double counting problem. Since each body 
part is fixed only considering the response of the test image 
with the trained templates, when sibling body parts (the same 
body part, but on different sides of the human body, like a left 
hand and a right hand) resemble each other, the same image 
patch might be picked repetitively. In this case, the estimated 
pose is occluded while in the real case it is not. 

To solve this problem, we use global optimization to fix 
each body part positions. In the original MoP model, where 
there is only HOG feature, optimizing body part position is 
very time consuming. For example, if the model uses 26 body 
parts, and each body parts use 5 or 6 mixtures, the minimum 
number of possible combination for all body parts is 5 2 5. This 
is a huge amount of possible guesses. In our case, where we 
consider human blob as another feature cues. The possible 
number of mixtures for each body parts is great reduced due to 
the constraint. So we can optimize the tree structure globally. 
The body part positions are optimized to maximize the score 
of the proposed model for combining multiple cues Smc' 


Y ■ 0(1, Pi) • F(Pi) + Y ' ^(Pi-Pj) ■ F (Pi) • F (Pj)’ ( ? ) 

i£V ij EE 


where F(jpi) is the ratio of overlap between the body part 
Pi and the foreground model define in equation^. 



Fig. 2. Detected body part localization with the Mixture of Parts (MoP) 
model. The left figure shows the localizations of all body parts with MoP. 
The right figure shows the overlap between extracted human blobs and all 
bounding boxes. Color variations in human blobs denotes different number 
of bounding boxes that are overlapped. 



C. From 2D Parts to 3D Pose Estimation 

The Gaussian process regressor is one of the most widely 
used regression model for learning 2D to 3D mapping in the 
pose estimation since it has been proved to be an effective ap¬ 
proach for the nonlinear 2D to 3D pose mapping problem (8j, 
0, ED. Gaussian Process Regression (GPR) is considered 
a model-free framework. GPR defined as a distribution over 
functions, extends statistics from data points to functions. With 
kernel trick, we can even get rid of the function definition, and 
only concentrate on kernel matrix instead. Once we normalize 
the training input to have a zero mean, we only need to 
define a covariance matrix, that is ,the kernel matrix, for 
GPR. Frequently used covariance matrices include squared 
exponential covariance function, Matern covariance function 
and so on. In the following subsections, we will explain 
detailed representations and settings for the Gaussian process 
regressor used here. 

1) Definition of Gaussian Process Regression: According 
to m, Gaussian process is defined as: a collection of random 
variables, any finite number of which have (consistent) joint 
Gaussian distribution. A Gaussian process is completely speci¬ 
fied by its mean function and a covariance function. Integrating 
with our problem, we denote the mean function as m( s) and 
the covariance function as fc(s, s'), so a Gaussian process is 
represented as: 

C(s) ~ GVj(m(s),k(s,s')), (8) 

where 

m(s) = E[ C(s)], 

H s,s') = £[(C(s) -m(s))(C(s / ) - 771 ( 3 '))], (9) 

2) Hyperparameter Optimization and Referencing: We as¬ 
sume prediction noise as a Gaussian distribution and formu¬ 
late finding the optimal hyperparameters as an optimization 
problem. We seek the optimal solution of hyperparameters by 
maximizing the log marginal likelihood (see m for details): 

logp(<P'|s,0) = - V T K->' - i log \K 9 ,\ - | log 2"7r, (10) 

where is the calculated covariance matrix of the target 
vector. 

With the optimal hyperparameters, the prediction distribu¬ 
tion is represented as: 


V> 8. ~ JV(k(s*, S) T [X + 

k(s*, s*) + al oise - k(s*,s) T [K + (r2 o . se /]-lk( s » ;S )) ! (11) 

where K is the calculated covariance matrix from training 
2D image features s and a no i se is the covariance of Gaussian 
noise. 

EquatioifTTI for referencing test data is deducted from 
marginal and conditional properties of Gaussian distributions. 
The following is the marginal property of Gaussian distribu¬ 
tions: the marginal of a joint Gaussian is again a Gaussian, 
that is, 


p(x,y) =H( 


a 


A B 

b 

’ 

B t c 


=> p(x) = A/”(a, A). 


Fig. 3. Body part localization after optimization (left) and final overlap map 
between all bounding boxes and the extracted human blob. 


( 12 ) 










5 


And the conditional property of Gaussian distributions are: 
the conditionals of a joint Gaussian are again Gaussian, that 
is, 


a 


A B ' 

b 

’ 

B t c 


=> p(x|y) = A/'fa+BC 1 (y - b), A - BC 1 B T ). (13) 

Thus we are able to predict the distribution of x given the 
distribution y. 

In most cases, we assume that Gaussian process priors have 
zero means, that is, 

~ GVj(m(x.) = 0,fc(s,s')). (14) 

This leads to a Gaussian process posterior 

/(^)|x, y, Mi ~ GVj(m post (x.) 5 kpost ( S 5 s')), (15) 

where 

m post (x) = k(x,x)[K(x,x) + <Tno< ae 7 ] _1 y- ( 16 ) 

With this posterior, we only need to define covariance 
matrices, known as kernel in machine learning community. 

The most frequently used covariance matrices (kernels) 
include: squared exponential (SE), Rational quadratic (RQ), 
Matern and Periodic, smooth covariance functions. The func¬ 
tion of covariance function is to define the distance measure 
in a newly transformed space where the original data samples 
have one to one correspondences with their mapped points and 
due to the transformation, data samples of different attribute 
classes in the new spaces are easier to classify or identify. With 
the kernel trick, we can get rid of directly defining the mapping 
model and only define the kernel matrix, the covariance matrix 
here. 

3) GPR for 2D to 3D Pose Mapping: Gaussian processes 
yield a method for specifying a probability distribution over 
functions by specifying a mean and a covariance function for 
the function values f(x). By training a Gaussian process with 
sample data {x,f(x)} the variance of the Gaussian process 
becomes small for function values f(x) at supporting points 
x included in the training data, which corresponds to an 
increased certainty about the function values at these points, 
while at other points x' the variance of the Gaussian process 
remains high which corresponds to a high uncertainty about 
the function values f(x') at such points. 

In our algorithm, we select the most commonly used covari¬ 
ance matrix: squared exponential covariance matrix. Given a 
2D pose estimate which is represented as the 26 dimensional 
vector BP (13 * 2, where 13 is the number of body joints 
in MoP and 2 is the dimension size), we train one Gaussian 
process to predict each of the 60 dimensions of the 3D pose 
vector ip (20 * 3, where 20 is the number of body joints in 
HumanEva motion capture data and 3 is the dimension size) 
separately. Then given features from test samples, we predict 
3D poses with trained GPR. 


Exp. 

Action 

Actor 

TrFrmNo 

TeFrmNo 

1 

Walking 

SI 

200 

21 

2 

Walking 

S2 

200 

21 

3 

Walking 

S3 

200 

21 

4 

Box 

SI 

200 

21 

5 

Box 

S2 

200 

21 

6 

Box 

S3 

200 

21 


TABLE I 

The composition of experiments from Humaneva data set. Exp. 
“TrFrmNo” denotes total frame numbers for training. 
“TeFrmNo” denotes total frame numbers for test. 


IV. Results 

To demonstrate the effectiveness of the proposed method, 
we first show improved 2D body part localization results with 
the proposed feature fusion method; then, the trained 2D to 
3D pose estimator is carried out on detected 2D body part 
locations and 3D poses are estimated and shown. 

A. Evaluation Data and Experiment Settings 

From HumanEva-I data set, we select two different actions 
(“Walking” and “Box”) performed by three different actors 
(“SI”, “S2” and “S3”). All three performers perform the 
actions within a fixed area (confined with a carpet). “Walking” 
is performed in a cyclic way, while in “Box”, performers are 
moving in a very small area positions notwithstanding different 
performing style. 

As a result, we have six different experiments in total. 
Training and test are carried out within each experiment. 
That is, we train a detector on a single experiment setting 
and validate the trained models on the test frames of same 
experiment setting. This experiment setting is designed to 
compare the influence of different action type and different 
performer to the body part localization and pose estimation 
results. The detailed splits between training and test is shown 
in the following table. 

For each action performed by a specific actor, training data 
are composed of 200 consecutive frames, which is close to the 
number of frames in a cyclic walking sequence. Test data are 
sampled with a equal step among the whole motion sequence 
excluding the training frames, so that the test poses covers all 
possible poses for an action. 

B. Enhanced 2D Part Locations 

The proposed 2D part localization method aims to solve 
double counting problem, that is, a pixel location (a body 
part location in our case) might be designated to two body 
limb positions even there is no occlusions between these body 
limbs. The reason for the double counting problem in MoP 
detection is: 

1) the responses of a pixel location (or a body part obser¬ 
vation) to all trained body part templates are calculated 
separately, 

2) then, from leaf nodes to root nodes, a best response is 
selected for a each node among all candidate mixtures 
and this response is passed as a message to its parent 
node, that is, a locally optimal solution. 




MoP detection 


Our detection 


MoP detection 


Our detection 



Fig. 4. Qualitative 2D pose estimation samples. Column “MoP detection” shows detected 2D body parts from the original MoP method. Column “Our 
detection” shows the enhanced 2D body part detection from our method. 


The limitation of this solution is that usually, the chained body target. 

part position calculated from local optimums are not a global in our solution, we introduce another feature cue (extracted 
optimum and the essence of the solution gives no globally human blobs in current experiments). In this way, not only 
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the localization could be verified by two features, but also we 
are able to optimize the target in a global way. This is due 
to the newly introduced feature cue gives global description. 
The following are main module that are incorporated: 

• we keep response scores from all mixtures of the current 
body part for later use. Instead of selecting the mixture 
with the maximum response in JT], we select pre-defined 
set of body part candidates whose overlap with another 
feature are over a certain ratio and put them in a candidate 
list, and calculate the best configuration whose overlap of 
two features are the maximum. The reason that we select 
a pre-defined set of body part candidates is that if we 
consider all body limbs, the calculation number might be 
too much (5*5*5*6*6*5*6*6*5*6*6*5*6*6 in 
our method, where we use fourteen body parts, with five 
or six mixture for each body part) and redundant because 
not all body limbs are possible to have double counting 
problem. The body parts that are possible to have double¬ 
counting problems are: two elbows (left and right), two 
hands, two knees and two legs. So we pre-define a set of 
body parts that could be optimized. 

• we add a mixture-selection module, where any mixture 
that has a overlap ratio of over a certain threshold ( 
thresh 2 = 0.2 in our experiment, due to the noisy ex¬ 
tracted silhouettes) are considered as a candidate mixture 
that can pass messages to it parent. 

• we optimize the position among all kept candidate po¬ 
sitions for a body part by fixing all other body part 
positions. If the overlap of two feature cues at a pixel 
location is within a interval (threshl = 0.5 in our 
experiment), for all pair of body part that might encounter 
double counting problem (that is, two elbows, including 
the left elbow and the right elbow, two hands, two knees 
and two legs), we check if they overlap. If they do, there 
is a possibility that they are double-counted, then they 
are added to the candidate list for local optimization. 

Note that, in our experiments parameters are set by experi¬ 
ence. It is also straightforward to acquire them from training 
data. For example, we can calculate all overlap ratios between 
bounding boxes of training body parts and extracted human 
blobs, fit Gaussian distribution, and take the mean of the fitted 
Gaussian as thresh2. Body part localization results are shown 
in figure [4] From the figure, we can see that double counted 
body parts are correctly localized after optimization. 

C. 3D Pose Estimations 

We further feed enhanced 2D body part locations to pose 
estimators and 3D poses are predicted. For each experiment, 
we train a set of Gaussian processes with Squared Exponential 
covariance matrix with the training set, the proposed 2D body 
part detectors are applied on test images, and detected 2D body 
parts are fed to the trained Gaussian process regressors to get 
3D pose estimations. Here we show some visualizations of 3D 
pose estimations. Figure [5] shows examples from walking and 
box actions. 

To have a qualitative comparison, we show in figure [6] 3D 
joint positions of the ground truth pose and the estimated 3D 



Fig. 5. Examples of visualized 3D pose estimation. The first row is a frame 
from actor SI performing walking. The second row is a frame from actor SI 
performing box. The third row is a frame from actor S2 performing box. The 
stick figure on the left is the ground truth data, and the stick figure on the 
right is the estimated 3D pose from localized 2D body part positions. 

pose. Both figures shows values from the first dimension of 
3D joints. The figure in the first row is from the left elbow 
of the actor “S2” performing “Walking” and the figure in the 
second row is from the left hand of the actor “S2” performing 
“Walking”. 

One direct application of the proposed method is for con¬ 
trolling 3D poses of avatar. Motion capture systems usually 
requires invasive body markers. While in our pipeline, per¬ 
formers are able to get rid of invasive body markers once 
training 3D poses are attained. What’s more, we only need 
image sequences from one single view point. 

V. Conclusions and Discussions 

In this paper, we design an algorithm to enhance the 
performance of 2D body part localization based on Mixture 
of Parts models which recently achieved good performances 
in 2D body part localization. Later on, we take the estimated 
poses as an input to estimate 3D poses. We validate our method 
in two ways: 2D body part localization visualized results and 
3D pose estimation errors. One interesting further work is to 
incorporate physical constraints into 3D human model so we 
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Fig. 6. The first dimension (values along axis x) of the left elbow position 
estimations (in green) and ground truth joint positions (in blue). The first 
figure is a frame from actor S2 performing walking. The second figure is 
a frame actor S2 performing walking. The x axis denotes frame id, ranging 
from 1 to 21. The y axis denotes values of the first dimension from 3D joint 
positions and its unit is milimeter. 


can optimize 2D body parts accordingly. We are also interested 
into validate this method on other public data set, like YouTube 
data set where 3D pose ground truth are not provided. 
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